Assignment (part 1, Unsupervised)

Exploration of the dataset. Descriptive statistics.

Data Preparation

In the data preprocessing part, we will mainly look for

missings and fill them

changing the necessary data types

Explanatory data analysis

Does gender have an influence on the total sum

As we can see, gender is affecting on the sum of the transactions

Which transaction amount type is most considerable

Transactions using mobile application is most demanded

Which transaction amount code is most considerable

Most of the transactions were carried out using Financial Institutions - cash withdrawal manually

Visualisation

Let's build a graph of the distribution of sum

Amount of transactions by code amd type descriptions

Graph of Frequency of TOP 20 Type of transactions. As we can see from the graph, Покупка POS and выдача наличных в ATM are most popular ones.

Graph of Frequency of TOP 20 Code of transactions. As we can see from the graph, Финансовые институты — снятие наличности aвтоматически and Финансовые институты — снятие наличности вручную are most popular ones.

Feature engineering

In this part we will make our data ready for model training

We have codes and types descriptions, but we cannot use it just like that, because they are categorical features. So, we will just drop them, because encoding will slightly affect on the modeling results

Feature scaling - standardizing the data

Supervised learning

Modeling

KNN

The algorithm doesn't do a better prediction than a random draw would. Let's find the k with the lowest error rate through iterations.

The intuition behind choosing the best value of k is beyond the scope of this article, but we should know that we can determine the optimum value of k when we get the highest test score for that value. For that, we can evaluate the training and testing scores for up to 40 nearest neighbors:

The error rate is what we want to minimize, so we want to know the k that gives the smallest error rate. Let's create a visual representation to make life easier.

0.42 is a very high error rate, but it is the best we're able to find. Let's now run the model again with k=8 again instead of k=3.

A confusion matrix helps us gain an insight into how correct our predictions were and how they hold up against the actual values.

We were able to classify a couple of more points correctly, but in general, an accuracy score of 0.6 is not good. It looks like we'd need more data (more features or larger dataset) to build a more robust model.

Model Analyzing

For any machine learning model, we know that achieving a ‘good fit’ on the model is extremely crucial. This involves achieving the balance between underfitting and overfitting, or in other words, a tradeoff between bias and variance. A useful tool when predicting the probability of a binary outcome is the ROC curve. It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate. The area with the curve and the axes as the boundaries is called the Area Under Curve (AUC). It is this area which is considered as a metric of a good model. With this metric ranging from 0 to 1, we should aim for a high value of AUC. Models with a high AUC are called as models with good skill.

Precision Recall curve is a direct representation of the precision(y-axis) and the recall(x-axis)

Random Forest

Let's try the Random Forest algorithm instead. We have already scaled data and split into train and test sets.

Nothing better than a random draw. Let's instead use grid search to find the best parameter values. Parameter tuning is the process to selecting the values for a model’s parameters that maximize the accuracy of the model.

Slightly better than previously, but still not noticably different from a random draw.

ROC Curve for Random Forest

Precision Recall Curve of Random Forest

SVM

Let's train the algorithm again, using the information from the grid search.

ROC Curve of SVM

Precision Recall Curve of SVM

Deccision Tree

Let's create a Decision Tree Model using Scikit-learn.

Let's estimate, how accurately the classifier or model can predict the type of cultivars. Accuracy can be computed by comparing actual test set values and predicted values.

Well, you got a classification rate of 75.85%, considered as good accuracy. You can improve this accuracy by tuning the parameters in the Decision Tree Algorithm.

In Scikit-learn, optimization of decision tree classifier performed by only pre-pruning. Maximum depth of the tree can be used as a control variable for pre-pruning. In the following the example, we plot a decision tree on the same data with max_depth=100. Other than pre-pruning parameters, we also try other attribute selection measure such as entropy.

ROC Curve of Deccision Tree
Precision Recall Curve of Deccision Tree

Conclusion

Neither KNN, Random Forest, Decision Tree nor the SVM algorithm are very useful in terms of predicting the gender of the customer based on the features Sum, Coded and Type. This indicates that the data does not have prediction capability. This doesn't come as a huge surprise, as we could already see in the EDA that there was little that suggested any major differences between the two genders when it came to these variables.

This work proposes a method that predicts users’ gender based on their transactions history information. This dataset is probably best suited for unsupervised ML techniques, but I was curious to see whether there are attributes that can help predict whether a customer is male or female. It's a small dataset (both in terms of features and number of customers), however I think this notebook gives a useful introduction to applying ML algorithms such as Random Forest, SVM, Decision Tree and KNN.

The best accuracy was with Deccision Tree model is 0.77

A high error rate indicates that the model is underfitting and has high bias. The model is not sufficiently complex, so it's simply not capable of representing the relationship between y and the input features. To combat this we could try increasing the number of input features.